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By performing a comprehensive study on 1832 segments of 1212 complete genomes of viruses, we 
show that in viral genomes the hairpin structures of thermodynamically predicted RNA secondary 
structures are more abundant than expected under a simple random null hypothesis. The detected 
hairpin structures of RNA secondary structures are present both in coding and in noncoding regions 
for the four groups of viruses categorized as dsDNA, dsRNA, ssDNA and ssRNA. For all groups 
hairpin structures of RNA secondary structures are detected more frequently than expected for a 
random null hypothesis in noncoding rather than in coding regions. However, potential RNA sec- 
ondary structures are also present in coding regions of dsDNA group. In fact we detect evolutionary 
conserved RNA secondary structures in conserved coding and noncoding regions of a large set of 
complete genomes of dsDNA herpesviruses. 

PACS numbers: 87.18.-hBiological complexity , 87.15.bdSecondary structure , 87.15.QtScqucnce analysis 



I. INTRODUCTION 

In recent years the discovery of the regulatory role of 
short RNA sequences has changed the view about the 
biological role of RNA in living organisms [T] . For a long 
time it was assumed that RNA had only an ancillary 
role in protein synthesis. Today biologists know several 
regulatory mechanisms fully controlled by RNA short se- 
quences [2] often characterized by a typical secondary 
structure presenting a certain number of hairpin struc- 
tures. A RNA hairpin structure is a secondary structure 
where a double stranded region of a single stranded RNA 
is formed by base-pairing between complementary base 
sequence on the same strand. 

RNA regulatory secondary structures have been de- 
tected in almost all living organisms ranging from viruses 
to Homo sapiens. The earlier discoveries of their regula- 
tory role have been performed in model organisms such 
as the little worm C. elegans [3] and in studies of the 
interaction between plants and viruses 0]. 

Small noncoding RNA regulatory sequences arc often 
characterized by the presence of hairpin structures. Hair- 
pin structures have been investigated in quite different 
organims with a variety of methods. Examples of these 
studies can be found in Ref.s [SI [H 13 [H [HI HH1 EH H21 HS1 
US 

In this paper we detect candidate RNA secondary 
structures which are characterized by a minimal value 
of the free energy of the strucure in the folded state. 
In a first investigation the free energy is estimated by 
using Mfold, which is a reference software for the esti- 
mation of the free energy of RNA secondary structures. 
The free energy of the folded RNA structure is com- 
pared with the one numerically observed for a random 
RNA sequence obtained by shuffling the base pairs of 
the real one. This first investigation has been performed 
in a large set of 1212 complete genomes of viruses. We 
have chosen to perform a comprehensive investigate of 



these organisms because viruses present a variety of ge- 
nomic structures and organization and there are indica- 
tions that small RNA structures play important antiviral 
roles in plants and insects. Although the details about 
interactions between viruses and the host silencing RNA 
machinery remain poorly understood there is a mount- 
ing evidence that RNA motifs may play a crucial role in 
different aspects of the viral life-cycle. Results about the 
biological role of RNA secondary structures in different 
regions of different families of viruses are known only for 
a limited set of specific viruses with a focus on their cod- 
ing regions [T5J [THl HZ] ■ It is therefore useful to perform 
a comprehensive investigation covering a large number of 
the complete genomes of viruses today available. 

In a second investigation focused on the important vi- 
ral family of herpesviridae we perform a search of RNA 
secondary structures by using RNAz [TH] . This is a com- 
puter program based on thermodynamic and compara- 
tive genomic indicators. It has been used to detect evo- 
lutionary conserved RNA secondary structures in several 
organisms |19j . We apply it to a large number of complete 
genomes of herpesviruses and search for candidate RNA 
secondary structures detected in coding and noncoding 
conserved genome regions. 

The paper is organized as follows: In Section 2 we il- 
lustrate the data set we investigate. Section 3 discusses 
the method we use to search RNA secondary structures. 
In this section a detailed discussion is done about the 
limits of validity of random null hypothesis used to point 
out the predicted RNA secondary structures. Section 
4 presents the results about the predicted RNA sec- 
ondary structures detected in the investigated viruses 
when they are grouped according to their type of nucleic 
acid. Section 5 presents the investigation performed with 
the RNAz program in the family of herpesviruses and 
Section 6 briefly concludes. 
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II. THE INVESTIGATED DATA SET 



This study aims to investigate the presence of poten- 
tial RNA secondary structures in a large set of complete 
genomes of viruses. For this reason, we analyze 1832 com- 
plete segments of 1212 complete viral genomes recently 
available. This set of complete genomes was downloaded 
from the GenBank database [20 in April 2006. Several 
classification systems of viruses exist being mainly based 
on phenotypic characteristics, including morphology, nu- 
cleic acid type, mode of replication, host organisms, and 
the type of disease they cause. In the present study we 
choose to focus on the Casjens and King [2T] classifica- 
tion of viruses. They classified viruses into 4 groups es- 
sentially based on type of nucleic acid. Specifically they 
consider the following four groups: (i) double stranded 
DNA viruses, (ii) single stranded DNA viruses, (iii) dou- 
ble stranded RNA viruses, and (iv) single stranded RNA 
viruses. Viral complete genomes can be structured in 
one or more segments. Our set of completely sequenced 
segments of complete genomes of viruses comprises 310 
dsDNA segments, 326 ssDNA segments, 896 ssRNA seg- 
ments, and 300 dsRNA segments. The size of different 
segments of viral genomes ranges from 220 bp to 1.18 
Mbp. The distribution of the segment length of investi- 
gated viruses is quite heterogeneous with many segments 
shorter than 10 4 bp and a limited number of long seg- 
ments. An overview of the length heterogeneity of con- 
sidered segments is obtainable from Fig. [I] where the 
ranking plot of the segment lengths is shown. The total 
number of base pairs of our set of complete genomes is 
28,56 Mbp. They include 23,92 Mbp for coding regions, 
and 4,65 Mbp for noncoding regions. We point out this 
aspect because our analysis for each virus is performed 
by distinguishing the detected structures in coding and 
noncoding regions. 

To complete the information about the analyzed set 
of viral segments, we provide a statistical summary of 
the CG content of viral segments. This information is 
provided in Table|T] The data reported in the table shows 
that the average CG content observed in different groups 
and its standard deviation is not too markedly different 
in coding and noncoding regions of all considered groups. 



III. THE SEARCH METHOD 

We detect candidate RNA secondary structures in 
RNA sequences of the viral segments by computing the 
minimum free energy structures predicted by the Mfold 
(version 3.2) software [32]. This widely used software 
estimates the difference between the free energy of the 
unfolded state from the one of the folded state of a RNA 
sequence. Calculations are performed with the tempera- 
ture parameter sets to 37 degree Celsius for all sequences. 

The investigated segments of the complete genomes 
are scanned with Mfold by using a sliding window of 80 
bp moving in steps of 40 bp. Each selected RNA se- 
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FIG. 1: Rank plot of the length of the 1832 viral segments 
investigated in our study. The minimal and maximal length 
is 220 bp and 1.18 Mbp respectively. 



TABLE I: Descriptive statistics of the investigated viral seg- 
ments according to the type of nucleic acid. Results are pro- 
vided separately for coding (CDS) and noncoding (NCR) re- 
gions. The quantity < CG > is the average value of the 
CG component of viral segments and std{CG} its standard 
deviation. The symbol # indicates the number of viral seg- 
ments while the length column indicates the total length in 
the considered group. 



Region 


group < CG > 


std{CG} # 


Length 


CDS 


all 0.447 


0.0658 


1780 


23.958 Mbp 


CDS 


dsDNA 0.445 


0.0946 


303 


17.3 Mbp 


CDS 


dsRNA 0.453 


0.0693 


286 


0.689 Mbp 


CDS 


ssDNA 0.433 


0.0440 


323 


0.709 Mbp 


CDS 


ssRNA 0.452 


0.0580 


868 


5.26 Mbp 


NCR 


all 0.425 


0.0865 


1815 


4.65 Mbp 


NCR 


dsDNA 0.396 


0.102 


304 


3.86 Mbp 


NCR 


dsRNA 0.465 


0.0826 


300 


0.081 Mbp 


NCR 


ssDNA 0.423 


0.0598 


326 


0.180 Mbp 


NCR 


ssRNA 0.421 


0.0853 


885 


0.531 Mbp 



quence is folded as a linear RNA sub-sequence. For each 
obtained value of the free energy AG associated with 
each investigated 80 bp RNA sequence, we estimate a 
Z-score Z — (AG— < AG s h u f >)/ std{AG s h u f} by com- 
paring the observed free energy AG with the mean value 
< AG s huf > and standard deviation std{AG s h u f} of the 
free energy computed by performing 100 mononucleotide 
shuffling of the considered 80 bp RNA sequence. 

The Mfold algorithm estimates the minimum free en- 
ergy by adding a negative stacking energy of base pairs 
(which is stabilizing the secondary structure) and a pos- 
itive energy term (which is destabilizing the secondary 
structure) associated with non-complementary bases [22] , 
i.e. hairpin loops, interior/bulge loops and multiloops. A 
stabilizing energy contribution comes form the stacking 
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energy between two adjacent base pairs. This is the rea- 
son why Workman et al. [23], and Rivas et al. [21], have 
concluded that an appropriate null hypothesis as ran- 
dom RNA sequence should be generated by taking into 
account the empirically observed dinucleotide frequency. 

On the other hand, investigations of RNA secondary 
structures in the coding region of hepatitis C virus [TB] 
have shown that in members of the Flaviviridae mononu- 
cleotide shuffling and dinucleotide shuffling produce free 
energy values (and therefore Z-scores associated to them) 
which are remarkably similar. In other words, the use 
of the computationally more convenient mononucleotide 
shuffling is providing results compatible with the use of 
a dinucleotide shuffling in this family of viruses. Sup- 
ported by this observation we have devised a test to as- 
sess whether a mononucleotide shuffling could be used as 
a good proxy for a dinucleotide shuffling in our investi- 
gations. Specifically, we perform a dinucleotide shuffling 
for the set of 378 largest viral segments. The total length 
of these largest segments is 22.2 Mbp which amounts to 
84% of the total length of the entire viral data set inves- 
tigated by us. They have f8.9 Mbp of coding regions 
and 3.3 Mbp of noncoding regions. The dinucleotide 
shuffling is performed by using the SHUFFLE program 
included in HMMER 2.2 package [25]. This program 
shuffles an RNA sequence while preserving both mononu- 
clotide and dinucleotide composition exactly. It uses the 
Altschul and Erickson algorithm [26 . To limit computer 
time we estimate the relation between the Z-score ob- 
tained by considering a mononucleotide shuffling and the 
Z-score obtained by considering a dinucleotide shuffling 
only for the sequences characterized by a mononucleotide 
Z-score smaller than -2. The results of our extensive test 
are shown in Fig. [2] The figure clearly show that the 
mononucleotide shuffling is a good proxy of the dinu- 
cleotide shuffling in the large set of investigated viral 
segments. For this reason in the rest of this paper we 
will use the Z-score computed by using a mononucleotide 
shuffling of the investigated sequences. 



IV. RNA SECONDARY STRUCTURES IN 
DIFFERENT GROUPS OF VIRUSES 

For each RNA sequence of 80 bp we compute the opti- 
mal secondary structure according to the Mfold software 
and we associate to each RNA sequence a Z-score. The 
lower is the obtained Z-score the lower is the probability 
that the considered folding occurred by chance. By per- 
forming our large scale investigation, we systematically 
find a number of secondary structures characterized by a 
low value of mononucleotide Z-score. 

To focus on a well defined part of the secondary struc- 
tures, for each computed RNA structure of minimum free 
energy, we identify all hairpin structures (HSs) that are 
present in each structure. Here we present results on HSs 
which have a stem length ranging from 6 to 40 bp. This 
is done both for real segments and for the corresponding 
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FIG. 2: Z-score obtained with a null hypothesis based on 
dinucleotide shuffling as a function of the Z-score obtained 
with a null hypothesis based on a mononucleotide shuffling. 
Each circle represents the Z-score values obtained for a 80 
bp sequence sampled from the set of the 378 largest viral 
segments under the condition that the mononucleotide Z-score 
is smaller than -2. The line is the linear regression of the set of 
points and it is characterized by a slope equals to 0.86. When 
we perform a linear regression on smaller sets of investigated 
points selected by the conditions Z < —3, Z < —4 and Z < 
— 5 the value of the coefficient of the linear regression is 0.82, 
0.77 and 0.73 respectively. 

null hypothesis obtained by performing a mononucleotide 
shuffling of each 80 bp RNA sequence maintaining its nu- 
cleotide composition. 

We first analyze the number of HSs observed in real 
and random sequences by conditioning on the Z-score 
value associated with the folding of the RNA sequence. 
The number of HSs observed in RNA sequences charac- 
terized by a Z-score smaller then -2, -2.5, -3, -3.5, -4, 
-4.5 and -5, is given in Tables [IT] Results are summarized 
separately for coding and non-coding regions. Our data 
clearly indicate that the number of HSs detected in the 
real viral segments is much higher than the one observed 
for the corresponding null hypothesis obtained for ran- 
dom segments generated with mononucleotide shuffling 
of real data. A higher number than the one expected 
under the null hypothesis is observed both in coding and 
in non-coding regions. This feature is more pronounced 
when we condition the analysis to RNA sequences char- 
acterized by low values of the Z-score. In fact, the lower 
is the mononucleotide Z-score the larger is the HS-ratio 
of number of HSs observed in the native segments to the 
number of HSs observed in the random segments. This 
result is much more pronounced in noncoding than in 
coding regions. In fact when we condition the analysis to 
RNA sequences with Z-score smaller than -5 we observe 
a ratio of 8.0 in the coding regions whereas we observe 
a ratio of 63 in noncoding regions. In summary HSs 
are much more abundant in viruses than expected under 



4 



TABLE II: Number of HSs detected in coding and noncoding regions of all segments of investigated viruses as a function of the 
Z-score of each investigated 80 bp sequence. For comparison we also report the number of HSs detected in random sequences 
obtained by mononucleotide shuffling of the real ones. The HS-ratio is the ratio between the number of HSs detected in real 
data and the number of HSs detected in random sequences. 



source 


region 


group 


Z< 


! Z < -2.5 


Z < - 


3 Z < -3.5 


Z < - 


4 Z < -4.5 


Z < -5 


real data 


CDS 


all 


27514 


14726 


7604 


3805 


1923 


1015 


539 


mononucleotide shuffling 


CDS 


all 


14461 


6424 


2709 


1094 


416 


160 


67 


HS-ratio real/shuffled data 


CDS 


all 


1.90 


2.29 


2.81 


3.48 


4.62 


6.34 


8.0 


real data 


NCR 


all 


8721 


5563 


3676 


2373 


1615 


1076 


752 


shuffled data 


NCR 


all 


2782 


1177 


515 


227 


97 


39 


12 


HS-ratio real/shuffled data 


NCR 


all 


3.13 


4.73 


7.14 


10.4 


16.6 


28 


63 



a simple but representative null hypothesis. The abun- 
dance is quite remarkable in noncoding regions where the 
HS-ratio reaches a value as high as 63 when the analysis 
is conditioned to low values of the Z-score of the RNA 
sequence. 

The results summarized in Table [IT] for the complete 
group of viral segments can be analyzed in more detail in 
terms of the 4 groups we use to classify the investigated 
viruses. Several investigations of ssRNA viruses such as 
the hepatitis G virus |T5] or the hepatitis C virus [T5] 
and virus of the family Flaviviridac [17 have concluded 
that RNA secondary structures are present in the coding 
regions of these viruses and some of them are also evo- 
lutionary conserved. It is therefore of interest to check 
our set to discriminate whether the detected secondary 
structures which are present in coding regions are present 
in ssRNA viruses or if they are also present in different 
group of viruses. In Table [TTT] we summarize the HS-ratio 
for the 4 groups we use in our classification, namely ds- 
DNA, dsRNA, ssDNA and ssRNA. The Table shows that 
the HS-ratio of ssDNA or ssRNA segments is higher than 
the one observed for dsDNA or dsRNA segments. In the 

Table we indicate with the symbol the case when the 

ratio is computed with a number of HSs detected in the 
real or shuffled segments smaller than 10. In other words 
these values have associated a large error in the estima- 
tion of the HS ratio and therefore we will not show them. 
The analysis of the Table shows a different behavior of 
the groups of viruses with respect to coding and noncod- 
ing regions. Specifically in noncoding regions the values 
of the HS-ratio are significantly higher than in coding re- 
gions. For example when Z < —4 in the coding regions 
of dsDNA viruses (the group where the best statistics is 
achieved due to the fact that these viruses are character- 
ized by long genomes) we observe a HS-ratio equals to 
3.02 whereas in noncoding regions the HS-ratio is equal 
to 16. A significant difference is also observed inside the 
same region (coding or non coding) when we distinguish 
among the different groups. For example when Z < —3.5 
in coding regions dsDNA are characterized by a HS-ratio 
of 2.47 whereas ssRNA have a HS-ratio of 6.49. The 



other groups of dsRNA and ssDNA have intermediate 
values. Similarly in noncoding regions dsDNA are char- 
acterized by a HS-ratio of 9.11 whereas ssRNA have a 
HS-ratio of 12. Table [HT| shows that in coding regions of 
ssRNA and ssDNA viruses present a HS-ratio which is 
significantly higher than the one observed in dsDNA and 
dsRNA viruses. It is known in the literature that RNA 
secondary structures with known or potential biological 
role are present in the coding regions of ssRNA viruses 



It is worth noting that the HS-ratio in dsDNA and 
dsRNA viruses is however significantly larger than one. 
This is an indication that the RNA secondary structures 
where these HSs are located might also have a poten- 
tial biological role. Just to provide a comparative evi- 
dence that these values are significantly higher than one 
we have performed our analysis on one chromosome of a 
model organism. Specifically we have investigated chro- 
mosome V of the C. elegans. When we perform our anal- 
ysis on the coding regions of this chromosome we obtain 
the following values for the HS ratio: 1.17 when Z < —2, 
1.22 when Z < -2.5, 1.16 when Z < -3, 1.10 when 
Z < -3.5, 1.10 when Z < -4, 1.62 when Z < -4.5 and 
1.86 when Z < —5. It is worth noting that noncoding re- 
gions of the chromosome V of the C. elegans also present 
high values of the HS-ratio. In fact when we perform our 
analysis on these regions we obtain 2.41 when Z < —2, 
3.32 when Z < -2.5, 5.05 when Z < -3, 7.75 when 
Z < -3.5, 13.6 when Z < -4, 20.8 when Z < -4.5 and 
38 when Z < —5. The analysis of the HS-ratio values 
obtained in coding and noncoding regions of the chromo- 
some V of the C. elegans shows that the HS-ratio values 
obtained in coding regions are significantly smaller than 
the ones we have observed for the dsDNA and dsRNA 
viruses. This observation has motivated us to search for 
the presence of conserved RNA secondary structures in 
dsDNA viruses. 
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TABLE III: HS-ratio for coding and noncoding regions of all viral segments conditioned on the Z-score value of the investigated 
sequence. We summarize the results also by grouping the viral segments according to their nucleic acid. The HS-ratio is not 
shown when the number of detected HSs is smaller than 10 in real or random data. 



source region 


group 


Z < - 


2 Z < -2.5 Z < - 


3 Z < -3.5 Z < - 


4 Z < -4.5 Z < -5 


HS ratio CDS 


all 


1.90 


2.29 


2.81 


3.48 


4.62 


6.34 


8.0 


HS ratio CDS 


dsDNA 


1.60 


1.82 


2.09 


2.47 


3.02 


4.19 


4.9 




ObXXIN -TV 


1 Q1 


2.26 


2.8 


4.6 








HS ratio CDS 


ssDNA 


2.54 


3.23 


4.7 


5.3 


7.6 






HS ratio CDS 


ssRNA 


2.77 


3.73 


5.06 


6.49 


10 


14 


19 


HS ratio NCR 


all 


3.13 


4.73 


7.14 


10.4 


16.6 


28 


63 


HS ratio NCR 


dsDNA 


2.83 


4.27 


6.22 


9.11 


16 


30 


58 


HS ratio NCR 


dsRNA 


2.3 


2.3 












HS ratio NCR 


ssDNA 


5.20 


9.3 


26 










HS ratio NCR 


ssRNA 


5.33 


7.78 


12 


12 









V. CONSERVED RNA SECONDARY 
STRUCTURES IN THE HERPESVIRUS FAMILY 

The herpesviridae are a family of large encapsulated 
DNA viruses. Herpesvirus genomes are circular or lin- 
ear dsDNA up to approximately 250 kb in length con- 
taining approximately between 70 and 220 genes. The 
herpesviridae are divided into three subfamilies: alpha, 
beta, and gamma herpesviruses [27l[28] according to their 
host range, cytopathology, and molecular phylogenetic 
analysis. All three groups have been found in primates 
including humans. 

We analyze here 21 completely sequenced genomes of 
herpesviruses and 1 large fragments. The herpesvirus 
genomes were downloaded from GenBank in July 2007 
from the website [20]. The set consists of 13 alpha 
herpesviruses, 3 beta herpesviruses and 6 gamma her- 
pesviruses. The phylogenetic information about these 
viruses we use in our analysis is derived from McGeoch 
et al. [29] and Davison et al. [30]. In our analysis, we 
cluster herpesvirus genomes in 6 subgroups according to 
their genera (alpha-1, alpha-2, alpha-3, etc., for details 
see Table [!V|. 

Specifically, we investigate 6 subgroups of genomes, 
each containing between 3 and 5 viral genomes. The 
analysis is performed as follows. We use a modified ver- 
sion of the Multiz-Tba package [31] to generate a whole- 
genome alignment for each of the 8 groups of genomes. 
The Tba program uses a phylogenetic tree and the cor- 
responding sequences as inputs, and outputs the result- 
ing whole-genome alignment. High sequence conserva- 
tion is not strictly needed for biological function of RNA 
secondary structures [3H [33] and therefore, we consider 
both medium and highly conserved genome regions in our 
analysis. 

We use RNAz (version 1.0) [TH] to detect consensus 
secondary structures for each of the above described her- 
pesvirus groups. Alignments are sliced in overlapping 
windows of size 120 and steps of 40 nt. Each series of 
windows starts at the beginning of a TBA block. All 



the sequences with gap-content greater than 25% of gaps 
are discarded from the alignment before analysis. Fur- 
thermore, we discard all sequences with masked letters 
content greater than 10%. This criterion is used for ex- 
cluding repeat sequences marked by RepeatMasker [34] . 
RNAz is currently limited to analyze alignments up to 
six sequences. Finally, we use RNAz to analyze both the 
forward and the reverse complement sequences. 

This kind of analysis might produce a certain num- 
ber of false positive RNA secondary structure. We esti- 
mate the expected number of false positive by performing 
the following procedure. We use the program rnazRan- 
domizeAln to shuffle the positions in an alignment. This 
shuffling removes any correlations arising from a native 
secondary structure and produces random alignment of 
the same length, the same base composition, sequence 
conservation, and gap patterns. The procedure is conser- 
vative providing a number of false positive higher than 
the one expected under a simpler random null hypothesis 
[T%] . This procedure therefore gives us an estimate of the 
false-positive rate expected for a specific input alignment. 
The program tries to maintain local conservation pattern 
by shuffling only columns of the same degree of conserva- 
tion, i.e. by shuffling the columns which show the same 
mean pairwise identity. We therefore repeat the complete 
analysis with randomized alignment blocks. 

Sequence similarities between the species belonging to 
the same subgroup are mapped and used to determine 
the level of genome conservation between the viruses. 
A relatively high number of similar regions are con- 
served within genera and much lower number conserved 
among members of different genera. According to the se- 
quence comparison method used, the alpha-1 and alpha- 
3 herpesvirus groups clearly share more coding regions 
(CDS) with detectable sequence homology than the other 
groups. We observe that the percentage of conserved 
CDS within alpha-1 herpesvirus generis is close to 94% 
(see Table IV I . Note that the percentage reported in the 
table is computed on conserved region length without re- 
moving gaps or insertions. This percentage is ranging be- 
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TABLE IV: 6 herpesvirus groups of complete genomes belonging to different herpesvirus genera. Each distinct group is 
investigated with the RNAz program. 



subfamily generis 


species 


CDS 


conserved 


NCR 


conserved 






length 


CDS 


length 


NCR 






(bp) 


(%) 


(bp) 


(%) 


alpha 1 


Human herpesvirus 2 


122543 


93 


32203 


55 


Human herpesvirus 1 


121248 


94 


31013 


54 


Cercopit. herpesvirus 1 


119287 


94 


37502 


47 


Cercopit. herpesvirus 16 


118774 


94 


37713 


46 


Cercopit. herpesvirus 2 


117121 


94 


33594 


47 




alpha alpha 2 


Equid herpesvirus 4 


124329 


64 


21268 


14 


Equid herpesvirus 1 


125445 


63 


24779 


12 


Suid herpesvirus 1 


108501 


66 


34960 


8 


Bovine herpesvirus 5 


116183 


66 


21638 


21 


Bovine herpesvirus 1 


114897 


67 


20404 


16 




alpha 3 


Gallid herpesvirus 3 


134015 


86 


30255 


30 


Gallid herpesvirus 2 


137562 


80 


40312 


40 


Meleagrid herpesvirus 1 


131798 


89 


27362 


32 




beta beta 1 


Chimp, cytomegalovirus 192875 


68 


48212 


28 


Human herpesvirus 5 


186974 


69 


48671 


27 


Cercopit. herpesvirus 8 


174220 


66 


47234 


32 




gamma 


Cercopit. herpesvirus 15 


118815 


73 


52281 


11 


1 Human herpesvirus 4 


132121 


67 


39702 


8 


Callitric. herpesvirus 3 


104103 


83 


45593 


12 


gamma 


gamma 


Porcine herpesviruses 1 


64095 


52 


9105 


3.5 


2 Alcelap. herpesvirus 1 


104327 


32 


26281 


1.0 


Equid herpesvirus 2 


108637 


31 


75790 


0.7 



tween 80% and 89% for alpha-3 herpesvirus group. The 
gamma-2 generis has the lowest number of conserved cod- 
ing regions. Only approximately 40% of the total coding 
regions are conserved. We also estimate the percentage of 
conserved noncoding regions between the genomes within 
genera. For example, within the gamma-2 group, approx- 
imately 2% of the noncoding regions are conserved. The 
alpha-1 group contains the highest percentage of con- 
served noncoding regions. 

The aim of our study is to select evolutionarily con- 
served motifs in subgroups of the Herpesviridae family. 
The threshold value we use for the "RNA class probabil- 
ities" is P=0.9 [18 . This is a rather stringent threshold 
value and it is useful to enhance the quality of the predic- 
tion. We observe that the number of conserved structures 
varies between herpesvirus subgroups. We find several 
conserved secondary structures in the coding regions of 
the gamma-1, beta-lA, alpha-1 and alpha-3 herpesvirus 
groups. In particular, in the coding region of the gamma- 
1 group we detect 29 conserved structures (see Table [v} . 
In the coding regions of beta- 1 A group we detect 25 con- 
served structures and 19 and 15 in the coding regions of 
the alpha-1 and alpha-3 herpesvirus groups respectively. 
In coding regions of dsDNA viruses we therefore observe 



a number of conserved RNA secondary structures which 
have a potential biological role. The presence of evolu- 
tionary conserved RNA secondary structures is therefore 
not limited to ssRNA viruses. 

We have extended our investigation on evolutionary 
conserved RNA secondary structures to the conserved 
non coding regions which are present in the 6 genera of 
investigated herpesviruses. Our investigation shows that, 
in this case, we observe only a small number of conserved 
structures in the noncoding regions. For example, the 
number of the conserved structures in the noncoding re- 
gions of the alpha-1 herpesvirus subgroup is only 3 (see 
Table VI I , in spite of the fact that this group has a rela- 



tively high percentage of conserved NCRs. 

To assess whether the detected conserved RNA sec- 
ondary structures have a known biological role, we com- 
pare all the conserved RNA structures with the sequences 
of the Rfam Database [35] and of the miRBase Database 
[3"S] . The miRBase Database contains 4584 hairpin pre- 
cursor miRNAs (miRBase release 9.2, May 2007), ex- 
pressing approximately 4700 mature miRNA products, 
in primates, rodents, birds, fish, worms, flies, plants and 
viruses. The Rfam Database contains 32897 sequences 
of known ncRNAs (Rfam version 8.1, April 2007). We 
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TABLE V: Rnaz hits with P>0.9 in coding regions of the 6 herpesvirus groups 



subfamily generis 


predicted estimated false Comparison 


Comparison 






structures 


positives 


with Rfam 
database 


Willi iiiij\,ijasc 

database 




alpha 1 


15 


4 





1 


alpha 


alpha 2 


3 


2 





1 




alpha 3 


19 


5 








beta 


beta 


25 


9 





2 


gamma 


gamma 1 


29 


11 





2 




gamma 2 


12 












TABLE VI: Rnaz hits with P>0.9 in noncoding regions of the 6 herpesvirus groups 



subfamily generis 


predicted estimated false Comparison 


Comparison 






structures 


positives 


with Rfam 
database 


with miRBase 
database 




alpha 1 


3 


1 








alpha 


alpha 2 
















alpha 3 


8 











beta 


beta 1 


5 











gamma 


gamma 1 





1 










gamma 2 















detect only 6 of the 4584 miRNA precursors in a global 
survey of the 6 herpesvirus genera. Our set of conserved 
RNA structures has no overlap with the structures an- 
notated in the Rfam database. Therefore, the large ma- 
jority of the 103 RNA secondary structures detected in 
coding regions and of the 16 RNA secondary structures 
detected in noncoding regions have completely unknown 
biological role. 



VI. CONCLUSION 

Our study shows that in viral genomes the hairpin 
structures of RNA secondary structures are more abun- 
dant than expected under a simple random null hypoth- 
esis. The random null hypothesis we use is admittedly 
simple but we have verified that it is a reliable proxy of 
the most accurate null hypothesis based on dinucleotide 
shuffling for the set of genomes we investigate. The de- 
tected hairpin structures of RNA secondary structures 
are present both in coding and in noncoding regions for 
the four groups of viruses we investigate in our com- 
prehensive study. For all groups hairpin structures of 
RNA secondary structures are detected more frequently 
than expected for a random null hypothesis in noncod- 
ing rather than in coding regions. Differently from our 
previous results reported in [13] , we observe that the de- 
tected hairpin structures are preferentially located in the 
noncoding regions. Therefore an approach based on com- 



binatorial considerations such as the one of Ref. [13] and 
a thcrmodynamically based approach like the present one 
lead to a different conclusion. 



The amount of excess observed in coding regions is also 
of interest. This excess is in agreement with recent ob- 
servation of the presence of evolutionary conserved RNA 
secondary structures detected in ssRNA viruses as the 
hepatitis G virus [15], the hepatitis C virus [T3] and 
viruses of the family Flaviviridae [17]. Our results in- 
dicate that this conclusion is not valid only for ssRNA 
viruses but rather a similar behavior should also be ob- 
served for other groups of viruses such as, for example, 
dsDNA viruses. To support this conclusion we have per- 
formed with the program RNAz a search of the evolution- 
ary conserved RNA secondary structures in conserved 
coding and noncoding regions of a large set of complete 
genomes of herpesviruses. We detect a significant number 
of evolutionary conserved RNA sequences in conserved 
coding regions of herpesviruses while only a few struc- 
tures are detected in conserved noncoding regions. In 
summary a large number of potential RNA secondary 
structures are predicted both in coding and noncoding 
regions of all groups of viruses with a clear preference 
for a location in noncoding regions. However, at least 
in herpesviruses, the degree of evolutionary conservation 
of these structure is more pronounced in coding than in 
noncoding regions. 
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